Goto

Collaborating Authors

 inception score


Token Perturbation Guidance for Diffusion Models

Neural Information Processing Systems

Classifier-free guidance (CFG) has become an essential component of modern diffusion models to enhance both generation quality and alignment with input conditions. However, CFG requires specific training procedures and is limited to conditional generation. To address these limitations, we propose Token Perturbation Guidance (TPG), a novel method that applies perturbation matrices directly to intermediate token representations within the diffusion network. TPG employs a norm-preserving shuffling operation to provide effective and stable guidance signals that improve generation quality without architectural changes. As a result, TPG is training-free and agnostic to input conditions, making it readily applicable to both conditional and unconditional generation. We further analyze the guidance term provided by TPG and show that its effect on sampling more closely resembles CFG compared to existing training-free guidance techniques. Extensive experiments on SDXL and Stable Diffusion 2.1 show that TPG achieves nearly a 2 improvement in FID for unconditional generation over the SDXL baseline, while closely matching CFG in prompt alignment. These results establish TPG as a general, condition-agnostic guidance method that brings CFG-like benefits to a broader class of diffusion models.


The proposition makes use of the following observation: For the discriminator defined in (1), the norm of gradient for wt is upper bounded by k wtDฮธ(x)k F kxk LY

Neural Information Processing Systems

The upper bound of gradient's Frobenius norm for spectrally-normalized discriminators follows directly. As lw(x) is a linear transformation, we have lcw(x) = c lw(x), and lw(cx) = c lw(x). Moreover, since ReLU and leaky ReLU is linear in R+ and R region, we have ai(cx) = c ai(x). In this section we discuss the gradients with respect the actual parameter wi. From Eq. (12) in [30] we know wtDฮธ(x) = A, we know that w0tDฮธ(x) F, otl(x)Dฮธ(x), and kotl (x)k have upper bounds. From Theorem 1.1 in [44] we know that if wt is initialized with i.i.d random variables from uniform or Gaussian distribution, E kwtkspis lower bounded away from zero at initialization. So k wtDฮธ(x)kF is upper bounded at initialization. Moreover, we observe empirically that kwtksp is usually increasing during training. Therefore, k wtDฮธ(x)kF is typically upper bounded during training as well. The following proposition states that spectral normalization also gives an upper bound on kHwi(Dฮธ)(x)ksp for networks with ReLU or leaky ReLU internal activations.







99f6a934a7cf277f2eaece8e3ce619b2-Supplemental.pdf

Neural Information Processing Systems

We use a variety of evaluation metrics to diagnose the effect that training with instance selection has on the learned distribution. In all cases where a reference distribution is required we usethe original training distribution,and not the distribution produced after instance selection. Inception Score (IS) [24] evaluates samples by extracting class probabilities from an ImageNet pretrained Inceptionv3 classifier and measuring the distribution of outputs over all samples. Classification Accuracy Score (CAS) [23, 25] was introduced for evaluating the usefulness of conditional generativemodels for augmenting downstream tasks such as image classification. Model Params (M) Batch Size Retention Ratio(%) IS FID P R D C BigGAN 52.54 512 100 25.43 10.55 - - - FQ-BigGAN 52.54 512 100 25.96 9.67 - - - - The truncation trick isasimple and popular technique which isused toincrease thevisual fidelity of samples from a GAN at the expense of reduced diversity [2].


4ffb0d2ba92f664c2281970110a2e071-Paper.pdf

Neural Information Processing Systems

TheobjectiveofGANs istoproduce random samples from atarget data distribution, given only access toan initial set of training samples. This isachievedbylearning twofunctions: ageneratorG,which maps random input noise to a generated sample, and a discriminatorD, which tries to classify input samples as either real (i.e., from the training dataset) or fake (i.e., produced by the generator).


TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up

Neural Information Processing Systems

The recent explosive interest on transformers has suggested their potential to become powerful ``universal models for computer vision tasks, such as classification, detection, and segmentation. While those attempts mainly study the discriminative models, we explore transformers on some more notoriously difficult vision tasks, e.g., generative adversarial networks (GANs). Our goal is to conduct the first pilot study in building a GAN \textit{completely free of convolutions}, using only pure transformer-based architectures. Our vanilla GAN architecture, dubbed \textbf{TransGAN}, consists of a memory-friendly transformer-based generator that progressively increases feature resolution, and correspondingly a multi-scale discriminator to capture simultaneously semantic contexts and low-level textures. On top of them, we introduce the new module of grid self-attention for alleviating the memory bottleneck further, in order to scale up TransGAN to high-resolution generation. We also develop a unique training recipe including a series of techniques that can mitigate the training instability issues of TransGAN, such as data augmentation, modified normalization, and relative position encoding. Our best architecture achieves highly competitive performance compared to current state-of-the-art GANs using convolutional backbones.